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Abstract 

We analyze the online response of the scientific community to the preprint publication of schol- 
arly articles. We employ a cohort of 4,606 scientific articles submitted to the preprint database 
^ arXiv.org between October 2010 and April 2011. We study three forms of reactions to these 

Q preprints: how they are downloaded on the arXiv.org site, how they are mentioned on the social 

Ph media site Twitter, and how they are cited in the scholarly record. We perform two analyses. 

\^ First, we analyze the delay and time span of article downloads and Twitter mentions following 

submission, to understand the temporal configuration of these reactions and whether significant 
I— I differences exist between them. Second, we run correlation tests to investigate the relationship 

between Twitter mentions and both article downloads and article citations. We find that Twitter 
mentions follow rapidly after article submission and that they are correlated with later article 
CJ downloads and later article citations, indicating that social media may be an important factor 

in determining the scientific impact of an article. 

> 

^ 1 Introduction 

Online social media, such as social networking and micro-blogging environments, have become 
a crucial component of public discourse. Scholars are therefore becoming increasingly interested 
in leveraging user-generated data on social media platforms to study a multitude of social, 
economic, and political phenomena. For example, social media data have recently been employed 
to infer social ties based on geographical proximity fl], study how diurnal and seasonal mood 
patterns match up with the effects of sleep and circadian rhythms [2], infer socio-economic 
^ patterns from public mood levels |3], validate Dunbar's number [4], and even explore how online 

chatter can lead to offline forms of political organization and mobilization [5]. Granted that 
social media discourse has a profound effect on these different aspects of everyday life, how is it 
affecting scholarly communication? 

The view from the "ivory tower" is that scholars make rational, expert decisions on what 
to publish, what to read and what to cite. In fact, the use of citation statistics to assess 
scholarly impact is to a large degree premised on the notion that citation data represent an 
explicit, objective expression of impact by expert authors |6J. Yet, scholars do not operate in an 
online vacuum. Download statistics have received increasing attention as a means to gain a fuller 
understanding of scholarly impact as manifested in the actual online behavior of scholars. In fact, 
recent results indicate that usage data can indeed provide fast, reliable and valid indicators of 
scholarly impact and can even underpin scholarly recommendation services iTl-p]. Interestingly, 
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it has been found that download statistics can predict future citation impact [10| , suggesting 
that onhne readership may be an important factor in driving scholarly citations. 

As an extension of this line of work, recent efforts have explored the effect of the use of 
social media environments on scholarly practice. For example, some research has looked at 
how scientists use the microblogging platform Twitter during conferences by analyzing tweets 



containing conference hashtags 11,12 . Other research has explored the ways in which scholars 
use Twitter and related platforms to cite scientific articles |13[|14| . More recent work has shown 
that Twitter article mentions predict future citations fls]. This article falls within, and extends, 
these lines of research by examining the temporal relations between quantitative measures of 
readership. Twitter mentions, and subsequent citations for a cohort of scientific preprints. 

In particular, we define as the starting point a cohort of 4,606 scientific articles submitted 
to the preprint database arXiv ( http : //arxiv . org[ ) between October 2010 and April 2011. 



ArXiv.org, a service managed by Cornell University Library, has become the premier pre-print 
publishing platform in physics, computer science, astronomy, and related domains. We study 
how the scientific community and the public at large respond to the online submission of these 
preprints by inspecting three forms of reactions. First, we calculate readership as measured 
in the number and temporal distribution of downloads of these articles on the arXiv.org site. 
Second, we explore the reaction of the social media, as measured in the number and temporal 
distribution of Twitter posts (tweets) that specifically mention those articles. Third, we look at 
the "official" reaction of the scientific community, as measured in the number of citations that 
a portion of preprints in the corpus received in the months immediately after their submission. 

We employ these three forms of impact statistics to perform two analyses. First, we inves- 
tigate the temporal relationships that exist between preprint downloads and Twitter mentions. 
We study both the temporal delay and span of these reactions. The delay is the time difference 
between the date of an arXiv submission and a subsequent spike in downloads or Twitter men- 
tions. The time span is measured as the time between the first and the last reaction (download 
or mention) for the article in question. We address the questions: How long does it take for 
an article to receive its maximum volume of downloads and Twitter mentions? (Time delay) 
And: How long does that response activity last? (Time span). Second, we investigate whether a 
correlation exists between how popular an article is on social media, as measured in the volume 
of Twitter mentions, and how much it gets downloaded and cited in the scholarly record. We are 
concerned with the question whether the increasing role of Twitter and other social networking 
environments in the scholarly community can affect citation- or usage-based indicators of schol- 
arly impact. In other words, we ask the question: Is there a correlation between the volume of 
Twitter mentions and the downloads and citations an article receives? 

In the following sections we outline the details of our data and study methods which are 
followed by a detailed analysis and discussion of the results. Our results show that download 
and social media responses follow distinct temporal patterns. Moreover, we observe a statistically 
significant correlation between social media mention and download and citation count. These 
results are highly relevant to recent investigations of scholarly impact based on social media 
data jl6yi7 ] as well as to more traditional efforts to enhance the assessment of scholarly impact 
from usage data. 
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2 Data and study overview 

Our analysis is based on a corpus of 4,606 scientific articles submitted to the preprint database 
arXiv between October 1, 2010 and April 30, 2011. For this corpus of articles, we gathered 
information relative to their arXiv readership. Twitter mentions, and citations in the scholarly 
record, as follows: 

1. Downloads: Weekly number of unique downloads for each article were computed for the 
aforementioned corpus of articles from the arXiv logs. 

2. Twitter mentions: A Twitter mention is considered to be an explicit or shortened link 
to an arXiv paper in the content of the tweet. We scanned a total of 1,959,654,862 tweets 
posted to Twitter between October 1, 2010 and April 30, 2011 finding that 4,415 arXiv 
articles from the corpus were mentioned on Twitter during this period (i.e., around 95% of 
articles posted on arXiv receives some Twitter coverage). The volume of tweeting events, 
however, was comparatively small, with only 5,752 tweets containing mentions of papers 
in the arXiv corpus (the most "mentioned' paper was tweeted about 113 times). 

3. Citations: From Google Scholar, we manually collected citation counts up to September 
30, 2011 for the top 100 most mentioned articles on Twitter in the corpus. The 100 most 
mentioned articles in the corpus accrued a total of 431 citations (the most cited article in 
the corpus was cited 62 times). 

Please refer to the Materials section for more details about our data collection and filtering 
methods, in particular on our definition of what constitutes a Twitter article mention. 

To illustrate our data collection and its relevance to the particular research question we 
address in this paper, consider as an example one of the papers in the corpus whose arXiv ID 
is 1010.3003, a preprint entitled "Twitter predicts the stock market", submitted to the arXiv 
subject domain cs (Computer Science > Computational Engineering, Finance, and Science). 
An excerpt of the timeline of this article (incomplete and purposely fictitious) is presented in 
Figure [T| The timeline begins with the submission of the article, on October 14, 2010. The 
article receives 47 Twitter mentions a few days after submission (fictitious data). During the 
first week after submission, the article receives 1,530 unique downloads and the following week, 
73,000 (also fictitious data). A couple of months after publication, in December the article 
receives its first citation in the scholarly record (again, fictitious data). The timeline of Figure [T] 
can be thought of as a stream of responses that a scientific article receives in the form of direct 
readership (article downloads on arXiv), reactions in a social media environment (mentions on 
Twitter), and citations in the scholarly record (as evinced by Google Scholar). 

By aggregating information about downloads, mentions, and citations in this fashion, we 
compiled three datasets. Some descriptive statistics about these datasets are presented in Figure 
[2j The first row of plots in Figure [2] displays the arXiv subject domains of (a) downloaded, and 
(b) Twitter mentioned papers (by percentage). A full list of the subject domain abbreviations 
used in these plots is available in the Materials section. Table |4j We observe a broad and evenly 
spread distribution of subject domains for downloads and mentions: most papers downloaded 
and mentioned on Twitter relate to Physics, in particular Astrophysics, High Energy Physics, 



4 



■ 2010-10-14 ■ 



arXlv.urg >a > arxlv:iai0.3003 



Preprint 

1010 3003 Computer Science > Computational Engineering, Finance, and Science 

submitted Twitter mood predicts the stock market 

to arXiv.org 

Johan Eollen, Huina Mao, Xiao-Jun 2erg 



ICOrt Weekiy downioads 
J www onarXiv.org 



□47 



Tweets about 
1103.0609 



□□□□□□□□ 



V\/eel<iy dow/nioads 
• ^y\J\J\J onarXiv.org 





M Citation in 
1 Google Sclnoiar 







Figure 1. Stream of responses (arXiv downloads, Twitter mentions, and Google Scholar 
citations) for arXiv preprint 1010.3003. All indicated figures and dates are fictitious. For 
demonstration only. 



and Mathematics. The second row of plots in Figure [2] displays the temporal distributions of 



(d) downloads, and (e) Twitter mentions. As shown in Figure 2(c), download counts of articles 
increase over time. This may be partly caused by a cumulative effect: papers that were published 
earlier have had more time to accumulate reads than papers that were published later. Figure 



2(d)| however, shows that the ratio of the total number of tweets that mention arXiv papers vs. 



all tweets decreases over time. This decrease may correspond to the overall increase in Twitter 
traffic relative to the number of tweets that mention arXiv papers. Also, a few sudden drops 
are caused by missing data during the crawling process. 

In order to better understand how Twitter mentions vary across domain, we show the Com- 
plementary Cumulative Distribution Functions (CCDF) of Twitter mentions for all articles in 
the five most frequently observed subjects domains of Figure [3j We find that within each do- 
main few papers receive relatively many mentions whereas the majority receive very few. The 
frequency-rank distribution is thus strongly skewed towards low values indicating that most ar- 
ticles receive very few Twitter mentions. Note that we rely on the so-called Twitter Gardenhose, 
a random sample of about 10% of all daily tweets, and may thus underestimate the absolute 
number of Twitter mentions by a factor of 10. (Refer to Materials section for more details). 
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Figure 3. Complementary Cumulative Distribution Functions (CCDF) of Twitter mentions 
for all articles in the 5 most frequently observed subjects domains 
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3 Results 

3.1 Delay and span of downloads and Twitter response to article submission 

We study both the delay and the time span of Twitter mentions and download responses to 
arXiv submissions. The delay is measured as the time difference between the date of a preprint 
submission and a subsequent spike in Twitter mentions (the day in which an article receives the 
highest volume of related tweets) or arXiv downloads (the day in which it receives the highest 
volume of downloads). The time span is the temporal "duration" of the response, measured 
as the time lag between the first and the last Twitter mention or download of the article in 
question. 

In Figure HI we plot the distributions of delay and time span of arXiv downloads and Twitter 



mentions following article submission. In Figure 4(a) , the delay curve shows that nearly all arti^ 



cles take at least 5 days to reach the peak of arXiv downloads (when x <= 4, the corresponding 
y value is 1, indicating that all articles take more than 4 days to reach the peak). In addition, 
the time span curve shows that most of the articles are downloaded persistently for over 100 
days (when x <= 100, the corresponding y value stays above 0.8). 

From Figure |4(b)[ it emerges that nearly 50% of the articles in the corpus reach the peak 
of Twitter mention just one day after they are submitted (on the delay curve, when x=2, the 
corresponding y value is almost 0.5, indicating that around 50% of the articles take more than 
one day to reach the peak) and over 80% articles are cited within 4 days of submission (when 
a; = 4, the corresponding y value is less than 0.2). However, over 90% of arXiv.org articles are 
mentioned one and one day only (when x = 2, the corresponding y value is below 0.1), i.e., 
one or multiple tweets about an article are posted within the time range of 24 hours and then 
are never mentioned again. Overall, compared with arXiv downloads, the Twitter response to 
scientific articles is typically swift, yet highly ephemeral, a pattern indicative of the news of 
publication being passed around and very little in-depth discussion taking place afterward. 

The distributions of Figure|4]show that the delay and span for Twitter mentions and the delay 
of arXiv downloads are highly skewed towards very low values, with very few cases characterized 
by extensive delays or time spans. However, the span of arXiv downloads maintains normally 
high levels. It is interesting that all curves seem to exhibit multi-scale power-law properties, i.e. 
their shape corresponds to f{x) oc with different values of a observed within particular 

ranges of x. This is particularly the case for the Twitter response in Fig. |4] which seems to be 
characterized by 3 distinct regimes where the decay of both delay and time span relative to x 
are somewhat consistent within specific ranges of x. This may indicate that the data is shaped 
by a similar processes, but occurring for different communities or samples (different values of 
a). Since we cannot validly determine a value of a that applies to the entire range of x, we 
manually identify a set turning points in the curve where a seems to change, and estimate the 
value of a between two turning points. We present some values of a for different ranges of x in 
Table [H 

In Table [2] we show the delay and span values for a sample of 8 articles in the corpus that 
exhibit the highest number of Twitter mentions in the period under study. A very wide range 
of delay and span values is recorded for both Twitter mentions and downloads. Yet, some 
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Figure 4. Delay and span of (a) arXiv downloads and (b) Twitter response to arXiv 
submissions 



8 



Table 1. a values of delay and time span of arXiv downloads and Twitter mentions 



arXiv downloads 



delay 


span 


range (days) 


a 


range (days) 


a 


1 < X < 5 


0.082 


1 < X < 120 


0.098 


5 < X < 10 


1.768 


120 < X < 150 


2.206 


10 < X < 12 


8.601 






12 < X < 200 


0.475 






X > 200 


3.022 






Twitter mentions 


delay 


span 


range (days) 


a 


range (days) 


a 


1 < X < 3 


0.879 


1 < X < 2 


4.031 


3 < X < 6 


2.336 


2 < X < 85 


0.754 


6 < 85 


0.799 


X > 85 


2.232 


X > 85 


2.645 







interesting patters emerge. For example, articles with smaller Twitter mention delays seem to 
have larger spans. In other words, when an article is quick to be noticed on Twitter, after its 
arXiv submission, the span of its social media response seems to be higher. As for downloads, 
much longer and more evenly distributed time spans are observed for all eight articles. To 
illustrate these effects, we examine the detailed response dynamics for the first article in Table 

m 

Table 2. Delay and span of a sample of eight papers in the corpus with a high volume of 
Twitter mentions, in chronological order of submission 



id 


Twitter mention 


arXiv download 


delay 


span 


delay 


span 


1 


4 


174 


11 


197 


2 


35 


96 


43 


183 


3 


28 


160 


13 


162 


4 


8 


143 


14 


141 


5 


105 


29 


111 


141 


6 


6 


141 


5 


141 


7 


5 


137 


11 


141 


8 


21 


50 


25 


127 



As shown in Fig. [5j the article in question was submitted to arXiv on October 14, 2010. In 
the diagram, time runs horizontally from left to right. Downloads and Twitter mentions on a 
given day are marked by vertical bars (weekly for downloads, daily for mentions), the color of 
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which indicates the volume of activity: black bars indicate the highest level, whereas lower levels 
are indicating by increasingly lighter shades of gray. As Fig. [5] shows, the Twitter response to 
submission, bottom rectangle, occurs within a day, reaches a peak within several days, and then 
slowly dies out over the course of the following week. The peak of arXiv downloads however 
occurs two weeks after submission, and does not die out afterward but continues to be marked 
by download for months, with lesser peaks at 11, 16, 19, 22, and 24 and 26 weeks. From a post 
hoc, ergo propter hoc point of view, in this case the Twitter response occurs immediately and 
nearly exactly before the peak in arXiv reads, suggesting that social media attention may have 
led to subsequently higher levels of arXiv downloads. 



arXiv 
downloads 

Twitter 
mentions 



article submission 
Twitter mention spil<e 
: arXiv download spil<e 
$6 a 



04 11 18 25 01 08 15 22 29 06 13 20 27 03 10 17 24 31 07 14 21 28 07 14 21 28 04 11 18 25 
October November December January February March April 

2010 2011 



Figure 5. Details of delay and time span for an arXiv preprint. 



The results described above are however observed for very small samples of articles that 
received very high levels of Twitter mentions. In the following section, we examine whether the 
number of Twitter mentions and arXiv downloads are indeed correlated over our cohort of arXiv 
submissions, and whether either have a visible correlation with subsequent citations. 

3.2 Correlation between article downloads, Twitter mentions, and citations 

We investigate whether Twitter mentions can influence downloads of an article and its citation 
impact, i.e. we study two correlations: 

T ~ ^: Twitter mentions T vs. arXiv downloads A. 

T ^ C: Twitter mentions T vs. citations C 

As we noted above, the frequency distribution of Twitter mentions is highly skewed towards 
low values, as shown in Figure [3} To remedy the effects of such highly skewed distribution, 
we adopt a threshold N which represents the number of most frequently mentioned articles on 
Twitter. We subsequently calculated the correlation coefficients T ^ A and T ~ C at various 
levels of A^. The calculated correlation coefficients for various values of A^ in the interval [3, 100] 
are listed in Table [3j The relationship between Twitter mentions and both downloads and 
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citations for the 100 most mentioned arXiv articles on Twitter is also depicted in the scatter 
plots of Figure |6j 

Table 3. Correlation coefficients of Twitter mentions vs. arXiv downloads (T ~ A) and 
Twitter mentions vs. Citations (T ~ C) for top articles with error margins 



N 


T A 


T r^C 


3 


0.203 ± 0.678 


0.893 ±0.143 


20 


0.205 ±0.219 


0.667 ±0.127 


60 


0.381 ±0.111 


0.521 ±0.095 


100 


0.391 ±0.085 


0.330 ±0.089 



We find that the correlation T ^ C increases as N is lowered, i.e. Twitter mentions and 
citations are more highly correlated when only the N most frequently mentioned articles are 
considered. This indicates that the most frequently mentioned articles on Twitter exhibit higher 
correlations with their citation counts. Specifically, when only the 20 most mentioned articles 
are selected vs. the 100 most mentioned, the correlation coefficient T ~ C increases from 0.330 
to 0.667. 

A notable pattern is the T ~ A correlation which does not increase when only the most 
mentioned articles are taken into consideration. For N = 100 we find a T ~ A correlation 
of 0.391 whereas for A'^ = 20 we find T ^ A = 0.205. In other words. Twitter mentions and 
arXiv downloads are correlated to some degree, but across all articles, and regardless of their 
frequency of mention. This correlation disappears as our analysis is focused only on the N most 
mentioned articles, meaning that even the most infrequently mentioned articles are significant in 
understanding the relation between Twitter mentions and arXiv downloads for the same article. 

The scatter plots of Figure [6] depict the relation between log Twitter mentions and log 
downloads, i.e. log(r) ~ log(^) in Fig. |6(a) and Twitter mentions vs. citations, log(T) 



log(C) in Fig. |6(b) for the top 100 most mentioned articles on Twitter {N = 100). The scatter 
plot shows that a weak but significant positive correlation T ^ A where r = 0.391, = 0.153, 
and T ^ C where r = 0.330, B? = 0.110. A visual verification indicates that the scatter is very 



high in both cases, but less so for the log(T) vs. \og{A) scatterplot in Figure 6(a 



4 Discussion 



Until a few decades ago, the dissemination of scientific knowledge was, by and large, a mechanism 
run by scientists and dedicated news outlets. In this virtually closed ecosystem, the de facto 
currency was the citation, the official attestation of scholarly credit and recognition. More 
recently, the ecosystem has opened up considerably. The public, not just dedicated scientists, 
is enjoying in recent years increased levels of free and open access to much of the scientific 
literature via a myriad of online services, such as bibliographic repositories, data archives, and 
science blogs. In an online open medium, new possibilities for measuring impact emerge, but it 
becomes more difficult to determine the communities that are driving different forms of impact. 
For example, one may think that usage data, measured as volume of downloads, may be biased 
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logdwitter mentions) log(Twitter mentions of top 100) 

Figure 6. Scatter plot of Twitter mentions vs. downloads (a) and citations (b) for 100 most 
mentioned articles on Twitter. 



towards the interests and preferences of the general public. As such, usage data would reflect 
merely the popular, online appeal of a given publication whereas citation data would reflect its 
actual scientific merits or impact. 

It is true that, to a certain extent, the shape of communities reading scientific material can 
be approximated and predicted in different ways. In recent work, in the field of astrophysics, it 
has been noted that the expected distributions of downloads break down depending on whether 
the origin of the incoming download is an astrophysics database (ADS) or a general search 
engine (Google), suggesting the existence of two distinct communities whose download behavior 
reflects different motivations and conceptions of scholarly importance (9]. Yet, reconstructing 
such "usage" communities with accuracy is a arduous, and if at all useful, task. We believe 
that online communities are necessarily overlapping, and that as science and its dissemination 
mechanisms progressively move to the web, they will naturally employ the metrics of impact 
and recognition of the web. 

The research presented in this paper is based on data from two sources which theoretically 
serve two different communities: arXiv.org, with its strong hard science focus, is targeted to 
scientists; Twitter, one of the largest social networks of the day, serves the public at large. 
Yet, both services are open to all, as they offer open access to collections of preprints, and 
user-generated tweets, respectively. Clearly, arXiv and Twitter serve both the general public 
and the scholarly community, but each to a lesser vs. greater degree. But in this study we 
did not to attempt to conceptualize arXiv downloads solely as scientific impact, and Twitter 
mentions solely as public chatter. Rather, we measured the correlation and temporal differences 
between these forms of responses, working under the assumption that these services naturally 
have overlapping user communities. 

Our results, though preliminary, are highly suggestive of a strong tie between social media 
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interest, downloads and even citation practices. We find that Twitter mentions and arXiv 
downloads of scholarly articles follow two distinct temporal patterns of activity, with Twitter 
mentions having shorter delays and narrower time spans than arXiv downloads. Moreover, 
we find that high volume of Twitter mention is statistically correlated with high volume of 
downloads, and, even more, high volume of "immediate" citations, i.e., citations in the scholarly 
record occurring just months after the publication of a preprint. 

The most immediate explanation for our results is that: as scholars are increasingly exposed 
to social media such as Twitter, their scholarly download and citation behavior is unavoidably 
affected. A paper submitted to arXiv that happens to receive high levels of mentions in social 
media will, as a result, receive greater exposure among both the general public and scholars. 
As a consequence, it will receive greater levels of scholarly interest, and higher volumes of 
downloads and subsequent citations. Our results indeed indicate that early Twitter mentions of 
a paper seem to lead to more rapid and more intense download levels and subsequently higher 
citation levels. An alternative, and equally plausible, explanation for our results lies in the 
inherent differences between various manuscripts, in terms of their quality or popular appeal. 
Manuscripts of greater quality or appeal, either among the public or the scholarly community, 
will by virtue of this characteristic enjoy higher levels of mentions on Twitter, higher levels of 
downloads on arXiv, and higher levels of later citations. 

We acknowledge that these observations can be the result of a number of distinct or over- 
lapping factors which our methodology confounds and fails to distinguish. Later research will 
focus on unraveling the potential causal mechanisms that tie the various factors together, and 
might shed light on whether and how social media is gradually becoming a crucial component 
of academic and scholarly life. 

5 Materials 

5.1 Abbreviations 

Table |4] presents a list of the subject domain abbreviations used in this article. 

5.2 Data collection 

Our process of determining whether a particular arXiv article was mentioned on Twitter consists 
of three phases: crawling, filtering, and organization. Tweets are acquired via the Streaming 
API from Twitter Gardenhose, which represents roughly 10% of the total tweets from public 
time line through random sampling. We collected tweets whose date and time stamp ranges 
from 2010-10-01 to 2011-04-30 which results in a sample of 1,959,654,862 tweets. 

The goal of the data filtering process is to find all tweets that contain a URL that directly 
or indirectly links to any arXiv.org paper. However, determining whether a paper has or has 
not been mentioned on Twitter is fraught with a variety of issues, the most important of which 
is the prevalence of partial or shortened URLs. Twitter imposes a 140 character limit on the 
length of Tweets, and users therefore employ a variety of methods to replace the original article 
URLs with alternative or shortened ones. Since many different shortened URLs can point to 
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Table 4. List of abbreviations for arXiv.org subject domains 



Subject Abbr. 


Description 


astro-ph 


Astrophsics 


hep 


High Energy Physics 


physics 


Physics 


math 


Mathematics 


cond-mat 


Material Science 


cs 


Computer Science 


quant-ph 


Quantum Physics 


gr-qc 


General Relative and Quantum Cosmology 


nucl 


Nuclear 


q-bio 


Quantitative Biology 


math-ph 


Mathematical Physics 


nhn 


Nonlinear Science 


Stat 


Statistics 


q-fin 


Quantitative Finance 



the same original URLs, we resolve all shortened URLs in our Twitter data set to determine 
whether any of them point to the articles in our arXiv cohort. 

We distinguish between four general types of scholarly mentions in Twitter, based on whether 
they contain: 

1. a URL that directly refers to a paper published in arXiv.org. 

2. a shortened URL that upon expansion refers to an arXiv.org paper 

3. a URL that links to a web page, e.g. a blog posting, which itself contains a URL that 
points to an arXiv.org paper. 

4. a shortened URL that links to a type (3) mention after expansion. 

In order to detect these four types of Twitter mentions, wc first expand all shortened URLs 
in our crawled public tweets. Wc select the top 16 popular URL shortening services, including 
bit.ly, tinyurl.com, and ow.ly, and expand the shortened URLs in our collection of tweets using 
their respective APIs. As such, we resolved 98,377,880 short URLs, which were mostly generated 
by the following URL sliorteners: bit.ly (61.3%), t.co (15.2%), fb.me (6.5%), tinyurl.com (6.1%) 
and ow.ly (4.4%). (We acknowledge that this procedure will not identify all Twitter mentions 
of a given arXiv.org paper, but it will however capture most.) From the resulting set, we retain 
all tweets that contain the term 'arXiv' and at least one URL. Next, we associate tweets to 
arXiv papers by extracting the arXiv ID (substrings matching 'dddd.dddd') from any papers 
mentioned in those tweets. (Note that in the case of the third and fourth type of Twitter mention 
the arXiv paper ID is not explicitly shown in the tweet itself, but needs to be extracted from 
the web pages that the tweet in question links to.) 
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